Matching

Data

We’ll load up the Lalonde data. Our goal here is to see if we can replicate or at least get close to the correct estimate in Lalonde’s 1986 paper using observational data. According to his model, the actual effect for the job training program was a $1,794 increase in the subject’s wages in 1978.

As usual a good starting point here is to get some basic descriptive stats on our variables.

lalonde|>
  select(treat, educ, married, nodegree, race)|>
  mutate(treat = factor(treat, labels=c("control", "treatment")),
         married = factor(married, labels=c('single','married')),
         nodegree = factor(nodegree, labels=c('degree','no degree'))
         )|>
  ggpairs(aes(fill=treat, color=treat), upper='blank') +
  theme_bw() +
  scale_fill_brewer(palette='Dark2') +
  scale_color_brewer(palette='Dark2')

lalonde|>
  select(treat, age, re74, re75, re78)|>
  mutate(treat = factor(treat, labels=c("control", "treatment")))|>
  ggpairs(aes(fill=treat, color=treat), upper='blank') +
  theme_bw() +
  scale_fill_brewer(palette='Dark2') +
  scale_color_brewer(palette='Dark2')

Regression

We’ll start with regression based estimates. The first model includes no controls, and the second includes controls for age, education, marital status, past income, degree status, and race:

# the uncontrolled regression
mod0<-lm(re78 ~ treat, data=lalonde)
mod1<-lm(re78 ~ treat + age + educ + married  + re74 + re75+ nodegree +race,data=lalonde)


modelsummary(list("treatment only"  =mod0, 
                        "treatment + controls" = mod1),
             estimate  = "{estimate}",  
             fmt = fmt_significant(),
             statistic ='conf.int',
             conf_level = .95,
             gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.', # remove some model stats 
             # add a title
             title = "DV: Wages"
             )
tinytable_rdkbce8nluci39n9u6ym
DV: Wages
treatment only treatment + controls
(Intercept) 6984 -1174
[6276, 7693] [-5998, 3650]
treat -635 1548.2
[-1926, 655] [13.9, 3082.6]
age 13.0
[-50.8, 76.8]
educ 403.9
[91.9, 716.0]
married 407
[-959, 1772]
re74 0.2964
[0.1819, 0.4108]
re75 0.2315
[0.0261, 0.4370]
nodegree 260
[-1404, 1924]
racehispan 1740
[-261, 3740]
racewhite 1241
[-269, 2750]
Num.Obs. 614 614
R2 Adj. 0.000 0.135
BIC 12712.0 12666.1

Matching methods

Propensity Scores

You can get propensity score matches by using method="nearest" and distance='glm'

psm_matched <- matchit(treat ~ age + educ +  married + race +
                    nodegree + re74 + re75, 
                    data = lalonde,
                    method = "nearest",
                    distance= 'glm'
                  )

Once we have our matchit object, the Cobalt package gives us some options for balance checking. After examining the results, you might want to go back and re-specify the model.

bal.tab(psm_matched, thresholds = c(m = .1), un = TRUE)
Balance Measures
                Type Diff.Un Diff.Adj        M.Threshold
distance    Distance  1.7941   0.9739                   
age          Contin. -0.3094   0.0718     Balanced, <0.1
educ         Contin.  0.0550  -0.1290 Not Balanced, >0.1
married       Binary -0.3236  -0.0216     Balanced, <0.1
race_black    Binary  0.6404   0.3730 Not Balanced, >0.1
race_hispan   Binary -0.0827  -0.1568 Not Balanced, >0.1
race_white    Binary -0.5577  -0.2162 Not Balanced, >0.1
nodegree      Binary  0.1114   0.0703     Balanced, <0.1
re74         Contin. -0.7211  -0.0505     Balanced, <0.1
re75         Contin. -0.2903  -0.0257     Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         5
Not Balanced, >0.1     4

Variable with the greatest mean difference
   Variable Diff.Adj        M.Threshold
 race_black    0.373 Not Balanced, >0.1

Sample sizes
          Control Treated
All           429     185
Matched       185     185
Unmatched     244       0
bal.plot(psm_matched)

love.plot(psm_matched)

Access the matched data with the match.data function:

psm_data<-match.data(psm_matched)

head(psm_data)

Then we can use the resulting balanced data in a regression model:

psm_fit<-lm(re78 ~ treat + age + educ +  married + race +nodegree + re74 + re75, 
   data = psm_data,
   weight = weights )

Finally, we want to estimate our effects with cluster robust standard errors:

avg_comparisons(psm_fit,
                variables = "treat",
                vcov = ~subclass)

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     1345        719 1.87   0.0614 4.0 -64.2   2754

Term: treat
Type:  response 
Comparison: 1 - 0
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
modelsummary(list(
                  "psm" = psm_fit
                  ),
             estimate  = "{estimate}",  
             fmt = fmt_significant(),
             statistic ='conf.int',
             conf_level = .95,
             vcov =~subclass,
             gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.', # remove some model stats 
             # add a title
             title = "DV: Wages"
             )
tinytable_d19sosys0xn8szsszt6r
DV: Wages
psm
(Intercept) -2582
[-9199, 4036]
treat 1345
[-69, 2759]
age 7.8
[-79.0, 94.6]
educ 602
[178, 1027]
married -158
[-1996, 1680]
racehispan 1533
[-491, 3558]
racewhite 469
[-1328, 2267]
nodegree 923
[-1392, 3239]
re74 0.0264
[-0.3012, 0.3539]
re75 0.221
[-0.109, 0.550]
Num.Obs. 370
R2 Adj. 0.025
BIC 7650.3
Std.Errors by: subclass

Coarsened Exact matching

For CEM, we’ll often want to set manually set some of “cutpoints”. There’s no set rules here. Visual inspection of the predictors can help, and the MatchIt package has some defaults that can be sensible, but this is partly a judgement call: what values of the independent variables seem like they should be grouped together?

By default, observations are matched exactly on factor variables. However, you can override this by setting passing a list of levels to the grouping argument:

cem_match <- matchit(treat ~ age + educ +  married + race +
                    nodegree + re74 + re75, data = lalonde,
                  method = "cem",
                  estimand ='ATE',
                  cutpoints =list(educ = c(0,9, 12, 14)),
                  grouping = list(race = list(c("white", "hispan"),
                                              c("black")))
                  )
summary(cem_match)

Call:
matchit(formula = treat ~ age + educ + married + race + nodegree + 
    re74 + re75, data = lalonde, method = "cem", estimand = "ATE", 
    cutpoints = list(educ = c(0, 9, 12, 14)), grouping = list(race = list(c("white", 
        "hispan"), c("black"))))

Summary of Balance for All Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
age              25.8162       28.0303         -0.2419     0.4400    0.0813
educ             10.3459       10.2354          0.0448     0.4959    0.0347
married           0.1892        0.5128         -0.7208          .    0.3236
raceblack         0.8432        0.2028          1.6708          .    0.6404
racehispan        0.0595        0.1422         -0.2774          .    0.0827
racewhite         0.0973        0.6550         -1.4080          .    0.5577
nodegree          0.7081        0.5967          0.2355          .    0.1114
re74           2095.5737     5619.2365         -0.5958     0.5181    0.2248
re75           1532.0553     2466.4844         -0.2870     0.9563    0.1342
           eCDF Max
age          0.1577
educ         0.1114
married      0.3236
raceblack    0.6404
racehispan   0.0827
racewhite    0.5577
nodegree     0.1114
re74         0.4470
re75         0.2876

Summary of Balance for Matched Data:
           Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
age              21.1665       20.7539          0.0451     0.8970    0.0144
educ             10.1082       10.1923         -0.0340     0.7899    0.0153
married           0.0619        0.0619         -0.0000          .    0.0000
raceblack         0.5876        0.5876          0.0000          .    0.0000
racehispan        0.1314        0.0940          0.1256          .    0.0375
racewhite         0.2809        0.3184         -0.0946          .    0.0375
nodegree          0.7423        0.7423         -0.0000          .    0.0000
re74            472.0782      742.1934         -0.0457     1.2640    0.0478
re75            466.2790      677.2163         -0.0648     1.1178    0.0549
           eCDF Max Std. Pair Dist.
age          0.1294          0.1133
educ         0.1186          0.3455
married      0.0000          0.0000
raceblack    0.0000          0.0000
racehispan   0.0375          0.2567
racewhite    0.0375          0.1933
nodegree     0.0000          0.0000
re74         0.2828          0.0847
re75         0.2699          0.1654

Sample Sizes:
              Control Treated
All            429.    185.  
Matched (ESS)   86.99   32.26
Matched        112.     82.  
Unmatched      317.    103.  
Discarded        0.      0.  

Now we can plot our results to check out how well the matching performed:

plot(cem_match, 
     type = "density", 
     interactive = TRUE, 
     which.xs = ~age +
married + re75)

bal.tab(cem_match, thresholds = c(m = .1), un = TRUE)
Balance Measures
               Type Diff.Un Diff.Adj    M.Threshold
age         Contin. -0.2419   0.0451 Balanced, <0.1
educ        Contin.  0.0448  -0.0340 Balanced, <0.1
married      Binary -0.3236  -0.0000 Balanced, <0.1
race_black   Binary  0.6404   0.0000 Balanced, <0.1
race_hispan  Binary -0.0827   0.0375 Balanced, <0.1
race_white   Binary -0.5577  -0.0375 Balanced, <0.1
nodegree     Binary  0.1114  -0.0000 Balanced, <0.1
re74        Contin. -0.5958  -0.0457 Balanced, <0.1
re75        Contin. -0.2870  -0.0648 Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         9
Not Balanced, >0.1     0

Variable with the greatest mean difference
 Variable Diff.Adj    M.Threshold
     re75  -0.0648 Balanced, <0.1

Sample sizes
                     Control Treated
All                   429.    185.  
Matched (ESS)          86.99   32.26
Matched (Unweighted)  112.     82.  
Unmatched             317.    103.  
bal.plot(cem_match)

love.plot(cem_match)

Finally, we’ll want to use the match.data function to extract the matched data and then estimate our model with the weights included:

cem_data<-match.data(cem_match)

cem_fit <- lm(re78 ~ 
            treat + age + educ + married + re74 +re75+
            nodegree+ race , data = cem_data, weights = weights)
avg_comparisons(cem_fit, variables = "treat", vcov = ~subclass)

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     1672       1079 1.55    0.121 3.0  -444   3787

Term: treat
Type:  response 
Comparison: 1 - 0
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
modelsummary(list(
                  "psm" = psm_fit,
                  "cem" = cem_fit
                  ),
             estimate  = "{estimate}",  
             fmt = fmt_significant(),
             statistic ='conf.int',
             conf_level = .95,
             vcov =~subclass,
             gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.', # remove some model stats 
             # add a title
             title = "DV: Wages"
             )
tinytable_k5npb0hgr1ros5fz8zye
DV: Wages
psm cem
(Intercept) -2582 -4693
[-9199, 4036] [-13834, 4449]
treat 1345 1672
[-69, 2759] [-458, 3801]
age 7.8 106.3
[-79.0, 94.6] [-41.9, 254.5]
educ 602 608
[178, 1027] [14, 1201]
married -158 -2533
[-1996, 1680] [-5348, 282]
racehispan 1533 957
[-491, 3558] [-422, 2335]
racewhite 469 2391
[-1328, 2267] [224, 4559]
nodegree 923 826
[-1392, 3239] [-1886, 3538]
re74 0.0264 -0.0964
[-0.3012, 0.3539] [-0.7940, 0.6012]
re75 0.221 0.6872
[-0.109, 0.550] [-0.0556, 1.4300]
Num.Obs. 370 194
R2 Adj. 0.025 0.061
BIC 7650.3 3999.8
Std.Errors by: subclass by: subclass

Question

See if you can improve on the CEM matching used above.

# your code